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Abstract 

Deep  neural  networks  have  become  increasingly  successful  at 
solving  classic  perception  problems  such  as  object  recognition, 
semantic  segmentation,  and  scene  understanding,  often  reach¬ 
ing  or  surpassing  human-level  accuracy.  This  success  is  due 
in  part  to  the  ability  of  DNNs  to  learn  useful  representations 
of  high-dimensional  inputs,  a  problem  that  humans  must  also 
solve.  We  examine  the  relationship  between  the  representa¬ 
tions  learned  by  these  networks  and  human  psychological  rep¬ 
resentations  recovered  from  similarity  judgments.  We  find  that 
deep  features  learned  in  service  of  object  classification  account 
for  a  significant  amount  of  the  variance  in  human  similarity 
judgments  for  a  set  of  animal  images.  However,  these  fea¬ 
tures  do  not  capture  some  qualitative  distinctions  that  are  a  key 
part  of  human  representations.  To  remedy  this,  we  develop  a 
method  for  adapting  deep  features  to  align  with  human  sim¬ 
ilarity  judgments,  resulting  in  image  representations  that  can 
potentially  be  used  to  extend  the  scope  of  psychological  exper¬ 
iments. 
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Introduction 

The  resurgence  of  neural  networks  in  the  form  of  deep  learn¬ 
ing  has  continued  to  dominate  object  recognition  benchmarks 
in  the  field  of  computer  vision,  often  attaining  near  or  above 
human-level  accuracy  for  a  variety  of  perceptual  tasks,  most 
notably  through  recent  advances  in  classifying  thousands  of 
objects  within  natural  images  (Krizhevsky,  Sutskever,  &  Hin¬ 
ton,  2012;  He,  Zhang,  Ren,  &  Sun,  2015).  Part  of  the  suc¬ 
cess  of  these  models  is  due  to  their  ability  to  learn  effective 
feature  representations  of  high-dimensional  inputs  (e.g.,  com¬ 
plex  color  images);  a  challenge  that  human  perception  must 
also  confront  (Austerweil  &  Griffiths,  2013).  As  a  result, 
cognitive  scientists  have  started  to  explore  how  the  represen¬ 
tations  learned  by  these  networks  can  be  used  in  models  of 
human  behavior  for  perceptual  tasks  such  as  predicting  the 
memorability  of  objects  in  images  (Dubey,  Peterson,  Khosla, 
Yang,  &  Ghanem,  2015)  and  predicting  judgments  of  cate¬ 
gory  typicality  (Lake,  Zaremba,  Fergus,  &  Gureckis,  2015). 

While  deep  learning  models  continue  to  mimic  a  growing 
list  of  human-like  abilities,  a  number  of  core  questions  re¬ 
main  unanswered  about  the  relevance  of  these  models  to  ac¬ 
tual  human  cognition  and  perception.  For  instance,  features 
of  the  input  learned  using  these  networks  excel  in  predicting 
certain  human  judgments,  but  how  are  these  feature  represen¬ 
tations  related  to  human  psychological  representations?  At 
first  glance,  it  would  seem  that  the  ability  of  these  represen¬ 
tations  to  predict  typicality  judgments  and  stimulus  memora¬ 
bility  would  constitute  robust  evidence  of  their  relevance  to 
people,  however  recent  work  has  shown  that  neural  networks 


that  classify  images  can  be  systematically  deceived  by  imper¬ 
ceptible  image  transformations  (Szegedy  et  al.,  2013),  casting 
doubt  on  their  similarity  to  humans. 

Understanding  the  relationship  between  the  representations 
found  by  deep  learning  and  those  of  humans  is  an  important 
question  in  cognitive  science,  and  could  potentially  benefit 
artificial  intelligence.  However,  independent  of  this  question, 
simply  having  a  good  approximation  to  how  people  represent 
images  would  allow  cognitive  scientists  to  test  psychological 
theories  using  complex,  realistic  stimuli.  Indeed,  tasks  such 
as  creating  stimulus  sets  that  uniformly  span  psychological 
space  are  far  from  trivial. 

In  this  paper,  we  address  this  question  directly  by  exam¬ 
ining  how  well  features  extracted  from  state-of-the-art  deep 
neural  networks  predict  human  similarity  judgments.  An  ini¬ 
tial  evaluation  shows  that  these  features  account  for  a  sig¬ 
nificant  amount  of  variance  in  human  judgments,  but  fail  to 
capture  qualitative  distinctions  that  are  key  to  human  repre¬ 
sentations.  We  then  develop  a  method  for  adapting  deep  net¬ 
work  features  to  better  predict  human  similarity  judgments, 
and  show  that  this  approach  can  reproduce  those  qualitative 
distinctions.  These  results  suggest  that  while  raw  features 
produced  by  deep  learning  may  not  be  suitable  for  use  in 
modeling  cognition,  they  can  be  modified  to  bring  them  into 
close  alignment  with  human  representations. 

Deep  Representations 

In  general,  deep  neural  networks  (DNN)  are  neural  networks 
that  have  depth  in  terms  of  their  number  of  hidden  layers  be¬ 
tween  input  and  output  (Bengio,  2009).  In  the  past  few  years, 
training  such  networks  to  understand  aspects  of  large,  com¬ 
plex  data  sets  has  led  to  a  number  of  advances  in  vision  and 
language  applications  (LeCun,  Bengio,  &  Hinton,  2015). 

In  computer  vision,  the  majority  of  this  progress  has  been 
driven  by  a  particular  DNN  called  a  convolutional  neural  net¬ 
work  (CNN)  (LeCun  et  al.,  1989).  CNNs  get  their  name  from 
the  use  of  convolutional  layers,  which  learn  a  set  of  image 
filters  that  produce  feature  maps  of  spatially-organized  inputs 
like  images.  This  allows  for  a  drastic  decrease  in  the  num¬ 
ber  of  parameters  the  network  must  learn,  which  would  oth¬ 
erwise  explode  exponentially  in  a  fully  connected  network 
with  high-dimensional  inputs.  The  typical  CNN  architec¬ 
ture  includes  a  series  of  hidden  convolutional  layers,  followed 
by  a  smaller  number  of  fully  connected  layers,  and  finally  a 
layer  that  generates  the  final  output  or  classification.  While 
CNNs  were  initially  developed  over  two  decades  ago,  they 
came  to  mainstream  popularity  in  2012  when  a  7-layer  ar¬ 
chitecture  named  AlexNet  (Krizhevsky,  Sutskever,  &  Hin- 


ton,  2012)  won  the  ImageNet  Large  Scale  Visual  Recognition 
Challenge  (ILSVRC),  reducing  the  previous  winner’s  error 
rate  by  an  uncommonly  large  margin.  Since  then,  a  deeper 
CNN  has  won  the  contest  every  year,  currently  dominated  by 
Microsoft’s  150-layer  network  which  obtained  a  best-of-top- 
5  error  rate  of  4.94%,  surpassing  the  accuracy  of  non-expert 
humans  at  5.1%  (He,  Zhang,  Ren,  &  Sun,  2015). 

Interestingly,  CNNs  produce  much  more  than  just  their 
outputs  (e.g.,  a  category  label  for  an  image);  they  can  also 
return  feature  representations  at  each  layer  of  the  network. 
The  “deep  representations”  learned  by  these  networks  have 
proven  useful  in  predicting  human  behavior.  Dubey,  Peter¬ 
son,  Khosla,  Yang,  &  Ghanem  (2015)  used  representations 
extracted  from  the  last  fully-connected  layer  of  a  CNN  to  pre¬ 
dict  the  intrinsic  memorability  of  objects.  That  is,  the  objects 
that  humans  are  jointly  likely  to  remember  or  forget  in  a  large 
complex  natural  scene  database.  The  correlation  between  es¬ 
timates  of  memorability  and  the  original  memorability  scores 
for  each  object  matched  human  consistency  (i.e.  the  correla¬ 
tion  between  memorability  scores  of  random  splits  of  the  full 
sample  of  subjects).  Similarly,  Lake,  Zaremba,  Fergus,  & 
Gureckis  (2015)  were  able  to  reliably  predict  human  typical¬ 
ity  ratings  of  eight  object  categories  using  the  same  network 
and  features,  and  called  for  cognitive  scientists  to  pay  atten¬ 
tion  to  deep  learning  since  categorization  is  a  foundational 
problem  in  the  field. 

Deep  representations  are  also  beginning  to  interest  the  neu¬ 
roscience  community.  For  example,  CNN  activations  have 
been  used  to  predict  monkey  IT  cortex  activity  (Yamins 
et  al.,  2014),  as  well  as  both  low-  and  high-level  activity 
in  human  visual  areas  (Agrawal,  Stansbury,  Malik,  &  Gal¬ 
lant,  2014).  Delving  deeper,  Khaligh-Razavi  &  Kriegeskorte 
(2014)  found  that  a  CNN  best  explained  IT  cortex  represen¬ 
tations  out  of  a  set  of  37  well-known  models  from  both  the 
computer  vision  and  neuroscience  fields,  although  no  model 
completely  explained  all  of  the  variance,  unsupervised  mod¬ 
els  being  the  worst  of  all  of  them. 

Although  CNN  representations  currently  do  the  best  job  of 
predicting  neural  activity  as  measured  by  Blood  Oxygenation 
Level  Dependent  (BOLD)  response,  this  does  not  guarantee 
that  we  can  explain  psychological  representations  as  a  result. 
In  fact,  Mur  et  al.  (2013)  was  partly  successful  in  predicting 
human  similarity  judgments  (a  classic  index  of  psychologi¬ 
cal  representations)  from  IT  cortex  representations.  However, 
the  key  categorical  distinctions  in  the  human  representations 
were  not  well  predicted:  human  IT  cortex  representations 
were  more  similar  to  monkey  IT  cortex  representations  than 
they  were  to  human  psychological  representations.  In  the  re¬ 
mainder  of  the  paper,  we  use  a  similar  approach  to  evaluate 
how  well  deep  network  features  align  with  human  psycholog¬ 
ical  representations,  and  to  explore  how  the  correspondence 
between  the  two  can  be  increased. 


Evaluating  Representations 

Our  first  step  is  to  evaluate  the  potential  correspondence 
between  deep  network  features  and  psychological  repre¬ 
sentations.  Unlike  neural  representations,  psychological 
representations  cannot  be  measured  directly.  However,  both 
spatial  and  hierarchical  psychological  representations  for  N 
objects  can  be  recovered  given  an  N  x  N  matrix  of  similarity 
judgments  using  methods  such  as  multidimensional  scaling 
and  hierarchical  clustering  (Shepard,  1980).  We  thus  reduce 
the  problem  to  one  of  capturing  human  similarity  judgments, 
subjecting  both  human  judgments  and  model  predictions  to 
these  different  methods  of  extracting  representations.  We 
approach  this  problem  by  taking  the  inner-product  of  the  deep 
feature  representations  of  each  pair  of  images  (a  measure 
of  similarity  between  two  vectors).  We  then  compute  the 
correlation  between  these  pairwise  vector  similarities  and 
human  similarity  judgments  for  the  same  stimulus  pairs, 
which  gives  us  a  measure  of  the  correspondence  we  want  to 
evaluate. 

Stimuli.  Our  stimulus  set  consisted  of  120  color  photographs 
of  animals  (sample  images  are  shown  in  Figure  1).  Images 
were  cropped  to  300  x  300  pixels,  resulting  in  close-ups  of 
either  the  animal’s  face  or  body.  The  set  was  constructed  to 
include  both  inter-  and  intraspecies  variation. 

Behavioral  Experiment.  We  collected  pairwise  similarity 
ratings  for  our  animal  stimulus  set  through  Amazon  Mechan¬ 
ical  Turk.  Participants  were  instructed  to  rate  the  similarity 
of  four  pairs  of  animal  images  on  a  scale  from  0  (not  similar 
at  all)  to  10  (very  similar).  We  paid  workers  $0.02  per  set 
of  four  comparisons.  Before  each  task,  eight  examples  were 
shown  to  help  prevent  bias  in  early  judgments.  Amazon 
workers  could  repeat  the  task  with  new  animal  pairs  as 
many  times  as  they  wanted.  There  were  7, 140  possible 
image  comparisons,  each  of  which  was  rated  by  10  unique 
participants,  for  a  total  of  71,400  ratings  from  209  different 
participants.  The  result  was  a  120  x  120  similarity  matrix 
after  averaging  over  judgments. 

Feature  Extraction.  We  extracted  features  for  each  image 
in  our  data  set  using  three  different  popular  off-the-shelf 
CNNs  of  varying  complexity  that  were  pre trained  in  Caffe 
(Jia  et  al.,  2014).  Specifically,  we  used  CaffeNet  (based  on 
original  AlexNet),  VGG16  (Simonyan  &  Zisserman,  2014), 
and  GoogLeNet  (Szegedy  et  al.,  2014),  the  layer  depths 
of  which  were  7,  16,  and  22  respectively.  GoogLeNet  and 
VGG16  achieve  roughly  half  the  error  rates  of  AlexNet. 
Each  network  had  already  been  trained  to  classify  1000 
object  categories  from  previous  ILSVRC  competitions.  A 
feedforward  pass  of  each  flattened  image  vector  into  each 
network  yields  feature  responses  at  each  layer.  For  our 
analysis,  we  extracted  the  last  layer  of  each  network  before 
the  classification  layer.  For  CaffeNet  and  VGG16,  this  is 
a  4096-dimensional  fully-connected  layer,  while  the  last 


Figure  1:  Samples  from  the  set  of  120  animal  photographs. 


Table  1:  Correlations  between  human  and  deep  similarities. 
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layer  in  GoogleNet  is  a  1000-dimensional  average  pooling 
layer.  Lastly,  we  also  extracted  Histograms  of  Oriented 
Gradients  (HOG)  and  Scale-Invariant  Feature  Transform 
(SIFT)  representations  for  comparison  since  such  features 
represent  the  generic  representations  of  choice  for  tasks  in 
computer  vision  prior  to  the  popularity  of  deep  learning. 

Results 

Table  1  gives  performance  ( R 2)  for  each  model.  Raw  repre¬ 
sentations  from  all  three  networks  show  medium  to  high  cor¬ 
relations  with  the  human  data.  In  general,  deeper  networks 
with  better  ImageNet  classification  accuracy  like  GoogLeNet 
and  VGG16  did  better  than  CaffeNet,  which  is  considerbly 
more  shallow.  The  HOG+SIFT  baseline  did  surprisingly 
poorly,  explaining  very  little  variance  as  compared  to  the  deep 
representations,  suggesting  that  while  these  features  are  use¬ 
ful  for  many  computer  vision  tasks,  they  differ  in  large  part 
from  the  representations  humans  employ  when  judging  ani¬ 
mal  similarity. 

Although  the  VGG  representation  explained  a  fair  amount 
of  variance,  further  analyses  revealed  that  the  most  crucial 
structural  aspects  of  the  human  representations  were  not  pre¬ 
served.  The  first  and  second  panels  of  Figure  2  show  multi¬ 
dimensional  scaling  (MDS)  solutions  for  the  original  human 
data  and  the  predictions  from  the  unaltered  deep  representa¬ 
tions.  While  the  structure  of  the  MDS  solutions  for  the  pre¬ 
dicted  judgments  looks  reasonable  (e.g.,  zebras  are  next  to 
other  zebras),  major  categorical  divisions  are  not  preserved. 
Hierarchical  clusterings  of  the  actual  and  predicted  human 
judgments  (the  first  and  second  panels  of  Figure  3)  show  a 
similar  pattern  of  results:  human  judgments  exhibit  several 
major  categorical  divisions,  whereas  much  of  this  structure  is 
lost  in  the  predicted  data. 


Adapting  Representations 

After  quantifying  the  discrepancy  between  deep  and  human 
representations,  we  can  attempt  to  bring  them  into  closer 
alignment.  First,  consider  that  the  final  hidden  layer  feature 
representation  in  a  neural  network  can  be  thought  of  as 
the  input  to  a  final  linear  classification  layer,  such  that 
the  problem  solved  by  the  final  weight  matrix  is  a  linear 
transformation  (which  is  then  often  scaled  by  a  softmax 
function  to  covert  to  class  probabilities).  This  can  be  thought 
of  as  a  rescaling  of  the  final  stimulus  representation  to  solve 
the  categorization  problem.  This  suggests  that  we  should  not 
think  about  the  features  extracted  by  the  network  as  a  static 
representation,  but  as  the  ingredients  for  a  transformation 
that  solves  a  problem.  Thinking  in  these  terms,  we  show  that 
we  can  easily  solve  for  a  linear  transformation  that  better 
captures  human  similarity  judgments. 

Similarity  Model.  Any  similarity  matrix  S  can  be  decom¬ 
posed  into  the  matrix  product  of  a  feature-by-object  matrix  F, 
its  transpose,  and  a  diagonal  weight  matrix  W, 

S=FWF7'  (1) 

This  formulation  is  similar  to  that  employed  by  additive  clus¬ 
tering  models  (Shepard  &  Arabie,  1979),  wherein  F  repre¬ 
sents  a  binary  feature  identity  matrix  (and  is  similar  to  Tver- 
sky’s  (1977)  model  of  similarity).  When  used  with  continu¬ 
ous  features,  this  approach  is  akin  to  factor  analysis.  Given 
an  existing  feature-by-object  matrix  F,  the  diagonal  of  W  can 
be  solved  for  using  linear  regression  where  the  predictors  for 
each  similarity  ,v;;  are  the  product  of  the  values  of  each  fea¬ 
ture  for  the  objects  i  and  j.  When  W  is  the  identity  matrix, 
this  reduces  to  the  model  evaluated  in  the  previous  section. 

Nf 

Sij  =  YJwkfikfjk-  (2) 

i—  1 

The  result  is  a  convex  optimization  problem  that  can  be 
solved  straightforwardly,  allowing  us  to  find  a  transformation 
of  the  deep  features  with  a  closer  correspondence  to  human 
similarity  judgments. 
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Figure  2:  Multidimensional  scaling  solutions  for  similarity  matrices  obtained  from  human  judgements  (left),  non-transformed 
deep  representations  (center),  and  transformed  deep  representations  (right). 
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Figure  3:  Hierarchical  clustering  of  human  judgements  (top),  deep  representations  (middle),  and  transformed  representations 
(bottom).  Human  judgments  resulted  in  nine  interpretable  clusters,  grouped  by  color  and  semantic  category  label  in  the  top 
panel.  The  leaves  of  the  deep  and  transformed  representation  clusterings  are  color-coded  relative  to  the  human  judgments. 


Table  2:  Model  performance  using  adjusted  CNN  features. 
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Analysis.  With  such  a  large  number  of  predictors,  regular¬ 
ization  is  critical  to  avoid  overfitting.  We  used  ridge  regres¬ 
sion  (L2  regularization)  and  performed  grid  search  on  cross- 
validated  generalization  performance  to  find  the  best  regu¬ 
larization  parameter.  We  predicted  only  the  upper  triangle 
of  the  similarity  matrix  since  the  matrix  is  symmetric.  Each 
model  was  evaluated  via  its  generalization  performance  in  6- 
fold  cross-validation.  We  did  this  for  the  feature  vectors  ex¬ 
tracted  at  each  layer  of  the  network. 

As  an  additional  control  against  overfitting,  we  compared 
model  performance  with  several  baselines.  In  Baseline  1,  we 
shuffled  the  rows  of  the  feature  matrix  such  that  the  feature 
representation  of  one  image  was  replaced  with  that  of  a 
different  randomly  chosen  image.  In  Baseline  2,  the  columns 
of  the  feature  matrix  were  randomly  permuted  for  each  row 
separately.  Lastly,  Baseline  3  simply  combined  the  shuffling 
schemes  from  the  first  two  baselines.  In  all  three  cases,  the 
randomized  feature  matrices  were  subjected  to  the  same  set 
of  analyses  as  the  true  features,  allowing  us  to  check  for 
spurious  correlations. 

Results 

Table  2  shows  performance  for  each  network  using  our  ad¬ 
justment  of  the  representations.  The  R2  values  reported  are 
the  average  values  across  all  six  folds  of  the  crossvalidation. 
All  five  models  performed  considerably  well,  each  showing 
improvement  over  the  original  non-weighted  models.  Most 
notably,  VGG  16  performed  best,  accounting  for  84%  of  the 
variance.  Training  using  the  estimated  regularization  param¬ 
eter  on  the  entire  dataset  yielded  an  R2  of  91%.  In  contrast, 
all  three  baseline  models  explained  essentially  no  variance 
(R2<0  .01),  suggesting  that  our  results  were  not  spurious  cor¬ 
relations  resulting  from  our  large  sets  of  predictors. 

Crucially,  the  MDS  solution  for  the  improved  predictions 
is  almost  identical  to  the  original  human  spatial  represen¬ 
tation.  The  same  improvements  were  found  in  hierarchical 
clusterings  of  actual  and  predicted  similarity  matrices  (1st 
and  3rd  panels  of  Figure  3),  this  time  largely  in  the  form  of 
top-level  parent  nodes. 

Feature  Analysis.  While  higher  layers  in  CNNs  tend  to  pro¬ 
duce  the  most  generic  high-level  features  for  domain  transfer 
across  image  applications,  the  choice  of  feature  depth  is  ulti¬ 
mately  dependent  on  the  task  (Sainath,  Kingsbury,  Mohamed, 
&  Ramabhadran,  2013).  This  implies  that  layer  responses 
at  different  depths  may  explain  different  types  of  human 
similarity  judgments  (e.g.  tasks  that  involve  comparing 
visual  features  versus  conceptual  information).  We  examined 


CaffeNet  Layer 


Figure  4:  Model  performance  as  a  function  of  feature  layer 
depth  in  CaffeNet. 

our  model’s  performance  in  predicting  similarity  judgments 
as  a  function  of  feature  depth  using  CaffeNet,  given  its  more 
straightforward  architecture  and  manageably-sized  layers. 
Specifically,  we  compared  performance  across  the  last 
three  convolutional  layers  and  the  last  two  fully-connected 
layers.  The  results  are  shown  in  Figure  4.  Performance  does 
appear  to  correspond  strongly  to  layer  depth,  although  fully 
connected  layers  perform  much  better  than  convolutional 
layers,  suggesting  that  human  similarity  judgments  may  not 
be  explained  well  from  simpler  image  features. 

Reweighted  Classification.  We  investigated  the  effect  of  our 
fine-tuned  representations  on  a  separate  animal  classifier,  us¬ 
ing  a  new  animal  data  set  consisting  of  1,740  images  from  19 
animal  classes  (bear,  cougar,  cow,  coyote,  deer,  elephant,  gi¬ 
raffe,  goat,  gorilla,  horse,  kangaroo,  leopard,  lion,  panda, 
penguin,  sheep,  skunk,  tiger,  zebra)  (Afkham,  Targhi,  Ek- 
lundh,  &  Pronobis,  2008).  We  used  multinomial  logistic  re¬ 
gression  with  6-fold  cross-validation  to  classify  animals  using 
fine-tuned  representations  as  predictors.  We  fine-tuned  these 
representations  by  pairwise  multiplying  the  original  VGG16 
representations  with  the  square-root  of  the  weights  obtained 
through  prediction  of  the  human  similarity  data.  However, 
because  some  of  the  weights  of  the  original  solution  are  neg¬ 
ative,  we  used  Elastic  Net  regression  to  solve  for  weights  con¬ 
strained  to  be  positive.  We  ran  the  same  model  using  the  orig¬ 
inal  unaltered  VGG16  representation  to  serve  as  baseline  per¬ 
formance.  The  original  model  performed  very  well  (average 
R2  =  .94),  whereas  the  fine-tuned  model  performed  consis¬ 
tently  worse  (R2  =  .89). 

Discussion 

This  analysis  constitutes  the  first  formal  comparison  of  deep 
representations  to  human  psychological  representations.  Ini¬ 
tial  results  using  currently  high-performing  convolutional 
neural  networks  show  that  the  two  representations  were  mod¬ 
erately  correlated,  but  diverge  in  terms  of  crucial  structural 
characteristics,  a  problem  exhibited  by  similar  experiments 


using  neural  representations  as  opposed  to  deep  features  (Mur 
etal.,2013). 

Our  method  of  overcoming  this  problem,  by  a  parsimo¬ 
nious  adjustment  of  the  feature  representation  inspired  by  a 
classic  model  of  similarity,  appears  to  have  been  largely  suc¬ 
cessful.  Indeed,  the  human  representations  were  almost  com¬ 
pletely  reconstructed  by  our  adjusted  CNN  features.  Using 
features  extracted  from  deep  convolutional  neural  networks 
provides  an  opportunity  to  estimate  psychological  represen¬ 
tations  from  real,  raw  sensory  inputs  (e.g.  pixels).  How¬ 
ever,  one  potential  limitation  of  this  work  is  the  generalizabil- 
ity  of  the  transformation  acquired  to  broader  stimulus  con¬ 
texts.  Testing  this  question  will  require  replication  and  trans¬ 
fer  across  several  domains.  To  the  extent  that  this  can  be 
established,  we  envision  our  method  as  a  standard  tool  for 
studying  cognitive  science  using  natural  stimulus  sets,  on  par 
with  modern  artificial  intelligence. 

Beyond  this,  we  see  potential  for  such  an  interface  be¬ 
tween  cognitive  science  and  artificial  intelligence  to  be  ex¬ 
ploited  for  the  benefit  of  each.  While  our  attempt  to  improve 
a  common  categorization  objective  in  computer  vision  (i.e. 
one-versus-all  classification)  using  human-tuned  representa¬ 
tions  was  not  successful,  it  does  raise  interesting  distinctions 
between  the  computational  problems  solved  by  humans  and 
CNNs.  After  all,  the  full  breadth  of  human  categorization  be¬ 
havior  exhibits  complex  patterns  such  as  overlapping  class  as¬ 
signments,  which  are  not  likely  to  be  well-represented  when 
the  learning  objective  is  defined  through  images  and  objects 
characterized  by  a  single  label.  Further,  one  might  ask  if  poor 
categorization  performance  of  the  one-versus-all  kind  is  the 
price  paid  for  a  more  flexible  system  of  categorization  with 
respect  to  a  set  of  complex  objects  that  can  be  partitioned  us¬ 
ing  several  “good”  configurations,  depending  on  the  context 
and  task  at  hand.  Given  this  possibility,  one  should  be  care¬ 
ful  not  to  equate  CNN  classification  performance  with  human 
categorization  abilities  in  general. 
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